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SIGNIFICANCE ANALYSIS OF MICRO ARRAYS 
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5 CROSS REFERENCE TO RELATED APPLICATION 

This is a continuation-in-part of U.S. Patent Application Serial No. 60/208,073, filed 
May 4, 2000, which is hereby incorporated by reference in its entirety for all purposes. 

BACKGROUND OF THE INVENTION 

10 This invention relates in general to statistical analysis of gene related data and, in 

particular, to analysis of microarray data for identifying genes that exhibit statistically 
significant behavior. 

Different biological systems are characterized by differences in the copy number of 
genes or in levels of transcription of particular genes. By measuring such biological 

1 5 phenomena, insight into and possible treatment of human diseases may be found. 

Microarrays of various types have been employed for measuring the expression levels 
of large numbers of genes. One type of microarray is the oligonucleotide microarray, one 
example of which is the Gene Chip® microarray manufactured by Affymetrix corporation of 
California. International Patent Application PCT/US96/14389, which is incorporated herein 

20 in its entirety, describes a method for measuring gene expression levels using oligonucleotide 
microarrays. In the method described, a nucleic acid sample is hybridized to a high density 
array of oligonucleotide probes immobilized to a surface, where the high density array 
contains oligonucleotide-type probes complementary to sequences of the target nucleic acids 
in the nucleic acid sample. For example, RNA transcripts of one or more target genes may 

25 be hybridized to an array of oligonucleotide probes immobilized on a surface such as that of 
a semiconductor chip. Some of the probes on the surface have sequences that are perfectly 
complementary to particular target sequences and are referred to herein as perfect match 
(PM) probes. Also present on the chip are probes whose sequence is deliberately selected 
not to be perfectly complementary to a target sequence. Such probes are referred to as 

30 mismatched (MM) control probes, where for each PM probe, there is a MM control probe for 
the same particular target sequence. This mismatch may comprise one or more bases. Thus, 
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faie biological sample such as a mRNA sample can be analyzed for gene expression for 
hybridization to above-described microarray on a chip. The presence of RNA sequences that 
bind to the oligonucleotide probes on the chips are then detected by methods such as tagging 
with a fluorescence material and then detecting the fluorescence. Since sequences that are 
5 different from the target sequences may also bind to the PM probes that correspond to such 
target sequences, the fluorescence signals from such sequences would appear as noise. 
Signal-to-noise ratio is improved by calculating the difference from signals from the 
sequences that bind to the PM probes and the signals from sequences that bind to the MM 
probes. 

10 Another type of microarray that has been used for analyzing gene expression utilizes 

cDNA probes. Although massive amounts of data are generated using oligonucleotide or 
cDNA probes, quantitative methods are needed to determine whether differences in gene 
expression are experimentally significant. Previous work on microarrays has utilized cluster 
analysis, to find coherent in expression patterns among genes or in cells. See, for example, 

1 5 the following three articles: 

1. Alizadeh, A., Eisen, M., Davis, R., Ma, C, Lossos, L, Rosenwal, A., Boldrick, L, 
Sabet, H., Tran, T., Yu, X., Marti, G., Moore, T., J, H., Lu, L., Lewis, D., Tibshirani, 
R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, K., Levy, R., 
Wilson, W., Greve, M., Byrd, L, Botstein, D., Brown, P. & Staudt, L. (2000) Nature 

20 403,503-511. 

2. Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998) Proc. Natl Acad. Set USA 
95, 14863-14868. 

3. Weinstein, J., Myers, T., O'Connor, P., Friend, S., Fornace, A., Kohn, K., Fojo, T., 
Bates, S., Rubinstein, L., Anderson, N., Buolamwini, J., van Osdol, W., Monks, A., 

25 Scudiero, D., Sausville, E., Zaharevitz, D., Bunow, B., Viswanadhan, V., Johnson, 

G., Wittes, R. & Paull, K. (1997) Science 275, 343-349. 

Cluster analysis works best for a large number of samples. Moreover, cluster 
analysis provides little information about statistical significance. To answer biologically 
important questions, a method is needed which can analyze a relatively small number of 
30 samples and provide a measure of statistical certainty. Methods based on conventional t-tests 
provide the probability (p) that a difference in gene expression occurred by chance. See for 
example, the following articles: 
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< *4. Roberts, C, Nelson, B., Marton, M., Stoughton, R., Meyer, M., Bennett, H., He, Y., 
Dai, H., Walker, W., Hughes, T., Tyers, M., Boone, C. & Friend, S. (2000) Science 
287, 873-880. 

5. Galitski, T., Saldanha, A., Styles, C, Lander, E. & Fink, G. (1999) Science 285, 251- 
5 254. 

In conventional t tests, p - 0.01 may be significant in the context of experiments 
designed to evaluate small numbers of genes. However, a microarray experiment for 10,000 
genes would identify 100 genes by chance. 

One approach for ascertaining the statistical significance of microarray data is known 

10 as the "fold change" method. In this approach, if one were interested in measuring the 
effects of radiation on gene expression, a number of biological samples are subjected to 
radiation, and their gene expression is then measured. Other biological samples are 
measured without being subjected to radiation. The "fold change" method identifies genes as 
having been changed significantly by the radiation if the ratio of the average gene expression 

15 measured after being subjected to the radiation to the gene expression measured without 
being subjected to radiation is greater than a certain threshold or less then another threshold. 
As further explained below, the "fold change" method, in some instances, yields 
unacceptably high false discovery rates. 

In one attempt to improve on the "fold change" method, genes are identified to be 

20 significantly changed if a certain fold change is observed consistently between paired 
samples. While this yields a moderate improvement over the "fold change" method, this 
improved "pair wise fold change" method still yields a rather high false discovery rate. 

As also noted above, conventional techniques analyze differences in gene expression 
levels, such as PM-MM, so that negative expression values are possible during analysis. 

25 Conventional methods of calculation and graphical representation employ log-log plots 
which do not permit negative values. Where linear plots are used instead for representing 
such possible negative values, it is found, however, that most of the values in the plots tend 
to congregate in a small area so that it is difficult to resolve them visually. It is, therefore, 
desirable to provide improved techniques for calculation and representation of data. 

30 It is, therefore, desirable to provide an improved system for analyzing and 

representing data obtained from microarrays whereby the above-described difficulties are 
alleviated. 
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'summary of the invention 

A new method, referred to herein as Significance Analysis of Microarrays (SAM), 
identifies genes with statistically significant differences in expression or other biological 
characteristics (such as gene copy number or levels of protein encoded by the genes), 
5 referred to below as values associated with the genes, by assimilating a set of gene-specific 
microarray data. For example, SAM may assign each gene a score representing such 
associated values, based on differences in gene expression or other biological characteristics 
in the data relative to the standard deviation of repeated measurements for that gene. Genes 
with scores greater than an adjustable threshold are deemed potentially significant. In some 

1 0 situations, gene expression may vary over a wide range of values, so that, in order to take full 
advantage of statistical analysis, it is preferable to choose statistical parameters for 
characterizing genes so that statistical significance can be assessed despite such variation of 
values. Preferably the parameters are chosen so that they are substantially independent of the 
ranges of values that characterize the genes. Thus, where a plurality of genes are associated 

15 with a plurality of sets of values obtained from data sources, a statistical parameter is 
provided that contains information concerning differences in the associated values of the 
genes among the sets. In one implementation, the parameters of the genes are adjusted so 
that the parameters are substantially independent of the average associated values of the 
genes over the sets. An observed value and an expected value of the adjusted parameter are 

20 calculated and compared to identify genes whose associated values differ by an amount of 
statistical significance among the sets. The sets of associated values of genes may be 
obtained from measurements using microarrays, data derived from such measurements, 
calculations or predictions using gene models, or other data sources. 

As noted above, gene expression or other biological characteristics of genes may vary 

25 over a wide range of values. Therefore, for genes whose expression or other characteristics 
have high values, even a difference that is a small percentage of the high values may 
overshadow and mask larger relative differences for genes whose expression or other 
characteristics have lower values. Furthermore, factors inherent in the process of acquisition 
of the data analyzed may introduce noise that may mask changes or differences in gene 

30 expression, or cause genes to be erroneously identified as having changes of statistical 
significance. This problem can be alleviated by ranking the genes by their values of the 
parameter, and by deriving expected values of the parameter of different ranks. The 
expected value for the parameter for each rank is then compared with the value of the 
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parameter of the gene of the same rank to identify genes that exhibit changes of statistical 
significance. 

In one embodiment, the expected value for the parameter for each rank is obtained by 
permuting the associated values of genes, deriving a value of such parameter for each gene in 
5 each permutation, ranking the values of the parameter, and obtaining an average value of the 
parameter of each rank for the permutations. 

Inherent in some statistical methods such as the one described above is that some 
genes may be erroneously identified as ones with statistically significant differences in 
expression or other characteristics. A good indication of the effectiveness of the method is to 
10 compute a false discovery rate for the method. 

To estimate the percentage of such genes identified by chance (the false discovery 
rate, FDR), nonsense genes are identified by analyzing permutations of the measurements. 
The threshold score can be adjusted to identify smaller or larger sets of genes, and FDRs are 
calculated for each set. 

15 The FDR may be found by permuting the associated values of genes, deriving a value 

of such parameter for each permutation, ranking the values of the parameter, and comparing 
the values of the parameter to a threshold to find the FDR. In one embodiment, this is 
implemented by counting the number of genes with parameter values that exceed a positive 
threshold or fall below a negative threshold. One possible method for estimating the FDR is 

20 to define FDR as the number of such nonsense genes divided by the number of actual genes 
with parameter values that exceed the positive threshold or fall below the negative threshold. 

Where SAM is used in connection with data analysis of diseases, gene expression or 
other characteristic values may correlate with patient survival time. In such event, pairs of 
death and risk sets may be defined, each pair having a corresponding patient death time, 

25 where the death set includes associated values corresponding to the death time and the risk 
set includes values corresponding to times occurring after the death time. A parameter is 
then provided for each of the genes containing information concerning differences in the 
associated values of the gene among the sets. An observed and an expected value of the 
parameter for each gene are then derived and compared to identify genes that exhibit 

30 behavior of statistical significance. 

To avoid the problem inherent in the conventional technique of using sharp 
thresholds in deriving representative values of genes, smooth weighting functions may be 
used to reduce distortion. In order to analyze and/or represent expression levels that may be 
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negative or positive in value, odd root values may be analyzed and/or graphically displayed 
so that the values do not congregate in a small area in the plot, and this facilitates analysis 
and comparison. 

The above-described features may be embodied as a program of instructions 
5 executable by computer to perform the above-described different aspects of the invention. 
Hence, any of the techniques described above may be performed by means of software 
components loaded into a computer or any other information appliance or digital device. 
When so enabled, the computer, appliance or device may then perform the above-described 
techniques to assist the analysis of sets of values associated with a plurality of genes in the 
10 manner described above, or for comparing such associated values. The software component 
may be loaded from a fixed media or accessed through a communication medium such as the 
internet or any other type of computer network. The above features embodied in one or 
more computer programs may be performed by one or more computers running such 
program(s). 

15 Each of the inventive features described above may be used individually or in 

combination in different arrangements. All such combinations and variations are within the 
scope of the invention. 

BRIEF DESCRIPTION OF THE DRAWINGS 

20 Fig. 1A is a linear scatter plot of gene expression in a sample hybridized to two 

microarrays using a conventional technique, where each gene (i) in the microarray is 
represented by a point with coordinates consisting of gene expression measured in uninduced 
cell line 1 from hybridization A, Xuia(1) ? and gene expression from the same cell line from 
hybridization B, xuib(i)- 

25 Fig. IB is a cube root scatter plot of gene expression from the data in Figure 1A to 

illustrate an aspect of the invention. 

Fig. 1C is a cube root scatter plot of average gene expression (avg x A ) from the four 

A hybridizations (induced and uninduced, cell lines 1 and 2) and the four similar B 

hybridizations (avg xb) to illustrate an aspect of the invention. 
30 Fig. ID is a cube root scatter plot of average gene expression from the four 

hybridizations with uninduced cells (avg xu) and induced cells 4 hr after exposure to 5 Gy of 

JR (avg xi), where some of the genes that responded to IR are indicated by arrows to 

illustrate an aspect of the invention. 
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Figs. 2A-2F are scatter plots of relative difference in gene expression d(i) versus gene 
specific scatter s(i), where the data were partitioned to calculate d(i) as indicated by the bar 
codes, and where the shaded and unshaded entries were used for the first and second terms in 
the numerator of d(i) in Equation 1 set forth below. 
5 Fig. 2A illustrates the relative difference between irradiated and unirradiated states, 

where the statistic d(i) was computed from expression measurements partitioned between 
irradiated and unirradiated cells. 

Fig. 2B illustrates the relative difference between cell lines 1 and 2, where the 
statistic d(i) was computed from expression measurements partitioned between cell lines 1 
10 and 2. 

Fig. 2C illustrates the relative difference between hybridizations A and B, where the 
statistic d(i) was computed from the permutation in which the expression measurements were 
partitioned between the equivalent hybridizations A and B. 

Figs. 2D, 2E, 2F illustrate the relative differences for three permutations of the data 
15 that were balanced between cell lines 1 and 2. 

Figs. 3A-3C illustrates a process for identification of genes with significant changes 
in expression. 

Fig. 3A is a scatter plot of the observed relative difference d(i) versus the expected 
relative difference d E (i), where the solid line at 45 degrees indicates the line for d(i) = d E (i), 
20 where the observed relative difference is identical to the expected relative difference, and 
where the dotted lines are drawn at a distance A = 1.2 from the solid line. 

Fig. 3B is scatter plot of d(i) versus scatter s(i). 

Fig. 3C is a cube root scatter plot of average gene expression in induced and 
uninduced cells, where the cutoffs for 2-fold induction and repression are indicated by the 
25 dashed lines, and where in all panels, the 46 potentially significant genes for A = 1.2 are 
indicated by the squares. 

Figs. 4A-4C illustrate a process for comparison of SAM to conventional methods for 
analyzing microarrays. 

Fig. 4A illustrates falsely significant genes plotted against number of genes called 
30 significant, where of the 57 genes most highly ranked by the fold change method, 5 were 
included among the 46 genes most highly ranked by SAM. 
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Fig. 4B is a Northern blot validation for genes identified by the fold change method, 
where values of r(i) are plotted for genes chosen at random from the 57 genes most highly 
ranked by the fold change method. 

Fig. 4C is a Northern blot validation for genes identified by SAM, where results are 
5 plotted for genes chosen at random from the 46 genes most highly ranked by SAM. The 
straight lines in Figs. 4B and 4C indicate the position of exact agreement between Northern 
blot and microarray results. 

Fig. 5 A is a graphical plot of a scatter function to illustrate effects of a conventional 
technique for processing gene expression which eliminates contributions from probes that 
10 diverge from a mean value by a predetermined cutoff. 

Fig. 5B is a graphical plot of a scatter function to illustrate effects of the use of a 
Gaussian weighting function for processing gene expression to illustrate an aspect of the 
invention. 

Fig. 6 is a block diagram showing a representative sample logic device in which 
1 5 aspects of the present invention may be embodied. 

For simplicity in description, identical components are labelled in the same numerals 
in this application. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

20 Because of its biological importance, SAM is applied to the transcriptional response 

of lymphoblastoid cells to ionizing radiation (IR). Although the data were obtained from 
oligonucleotide microarrays representing 6800 genes, SAM can also be applied to cDNA 
microarrays in a similar manner. 

25 Materials and Methods Used in the Invention: 

Preparation of RNA. Lymphoblastoid cell lines GM14660 and GM08925 (Coriell 

Cell Repositories, Camden, NJ) were seeded at 2.5 x 10 5 cells/ml and exposed to 5 Gy 24 
hours later. RNA was isolated, labeled and hybridized to the HuGeneFL GeneChip® 
microarray according to manufacturer's protocols (Affymetrix, Santa Clara, CA). 
30 Microarray hybridization. Each gene in the microarray was represented by 20 

oligonucleotide pairs, each pair consisting of an oligonucleotide perfectly matched to the 
cDNA sequence and a second oligonucleotide containing a single base mismatch. Because 
gene expression was computed from differences in hybridization to the matched and 
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mismatched probes, expression levels were sometimes reported by the GeneChip® Analysis 
Suite software as negative numbers. To compare data from different microarray 
hybridizations, a reference data set was constructed from the average expression for each 
gene over the 8 data sets. Gene expression for each hybridization was plotted against the 
5 reference data set in a cube root scatter plot and scaled by a linear fit to the data points. Data 
were then cubed to return values to the original scale. 

Northern blot hybridization. Total RNA (15 p,g) was resolved by agarose gel 
electrophoresis, transferred to a nylon membrane, and hybridized to specific DNA probes, 
which were prepared by PCR amplification. 

10 

Results of Applying the Invention to a Biological System: 

RNA was harvested from two wild type human lymphoblastoid cell lines, designated 
1 and 2, growing in an unirradiated state U, or in an irradiated state I, 4 hr after exposure to a 
modest dose of 5 Gy of IR. RNA samples were labeled and divided into two identical 

15 aliquots for independent hybridizations, A and B. Thus, data was generated from eight 
hybridizations (U1A, U1B, U2A, U2B, II A, II B, I2A, I2B). 

To assess reproducibility in the data, identical aliquots of an mRNA sample (U1A 
and U1B) were analyzed with two microarrays from the same manufacturing lot. A linear 
scatter plot for gene expression confirmed that the data was generally reproducible (Figure 

20 1 A), but failed to resolve the vast majority of genes that are expressed at low levels. To better 
resolve these genes, we chose to display the data in a cube root scatter plot. This permitted 
the inclusion of negative levels of expression that are sometimes generated by the 
GeneChip® software. The cube root scatter plot (Figure IB) revealed three salient features: 
the large percentage of genes (24%) assigned negative levels of expression, the large 

25 percentage of genes with low levels of expression, and the low signal to noise ratio at low 
levels of expression. 

Figure 1A is a linear scatter plot of gene expression in a sample hybridized to two 
microarrays using a conventional technique, where each gene (i) in the microarray is 
represented by a point with coordinates consisting of gene expression measured in uninduced 
30 cell line 1 from hybridization A, xuia©, and gene expression in the same cell line from 
hybridization B, xuib(i). As can be observed from Figure 1A, only a small number of highly 
expressed genes are resolved visually, with most of the genes compressed into a small region 
of the plot so that they would be difficult to resolve visually. One method of distributing 
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such data points more uniformly is a logarithmic scatter plot, but the log function cannot 
accept the negative values for gene expression generated by the microarrays. Figure IB is a 
cube root scatter plot of gene expression from the data in Figure 1 A to illustrate an aspect of 
the invention. As will be clear from a comparison of Figures 1A, IB, the genes with lower 
expression levels are more visually resolved in cube root plot of Figure IB compared to Fig. 
1A. While cube root plots are illustrated herein, it will be understood that the fifth root or 
other odd root plots may be used instead and are within the scope of the invention. 

After scaling the data from different microarray hybridizations, a scatter plot was 
generated for average gene expression in the four A aliquots vs. the average in the four B 
aliquots, a partitioning of the data that eliminates biological changes in gene expression. The 
scatter was improved by averaging multiple data sets (compare Figures IB and 1C). Figure 
1C is a cube root scatter plot of average gene expression from the four A hybridizations (avg 
x A ) and the four B hybridizations (avg x B ). 

To assess the biological effect of IR, a scatter plot was generated for average gene 
expression in the four irradiated states vs. the four unirradiated states (compare Figures 1C 
and ID). Figure ID is a cube root scatter plot of average gene expression from the four 
hybridizations with uninduced cells (avg Xu ) and induced cells 4 hr after exposure to 5 Gy of 
IR (avg Xl ), where some of the genes that responded to IR are indicated by arrows to 
illustrate an aspect of the invention. A few of the potentially significant changes in gene 
20 expression are indicated by arrows in Figure ID, but the effect was not easily quantified, and 
it is desirable to provide a better method to identify changes with a level of statistical 
confidence. 

The approach adopted herein was based on analysis of random fluctuations in the 
data. In general, the signal to noise ratio decreased with decreasing gene expression (Figures 
IB-ID). However, even for a given level of expression, it is found that fluctuations were 
gene specific. To account for gene-specific fluctuations, a statistic is defined based on the 
ratio of change in gene expression to standard deviation in the data for that gene. The 
"relative difference" d(i) in gene expression is: 



rf(0 = [2f/(0-^(0]/W0 + J 0 ] 



(1) 



where xi(i) and xu(i) are defined as the average levels of expression for gene (i) in states I 
and U, respectively. The "gene-specific scatter" s(i) is the standard deviation in the data: 



■ 



I 
1 1 
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5(0 = ({Z-[^(0-^(0] 2 +ZJ^(0-^(0] 2 }{1/^ +1/^}/^ + « 2 -2}) 1/2 (2) 

where £ m and £ n are summations of the expression measurements in states I and U, 
5 respectively, and ni and n 2 are the numbers of measurements in states I and U (4 in this 
experiment). A constant s Q = 3.3 was chosen by minimizing the coefficient of variation of the 
standard deviation of d(i) as a function of s(i), thus permitting d(i) values to be compared 
among all genes in the microarray. While a relative difference parameter d(i) as set forth in 
equation (1) is preferable, it will be understood that other difference functions that depend on 
10 the differences between the associated values of the genes among the sets (e.g. set of 
measurements in state U and set of same in state I) and on scatter values among the sets may 
be used and are within the scope of the invention. 

As noted above, factors inherent in the process of acquisition of microarray data itself 
may introduce noise that renders it difficult to discover the significance of differences in 
15 gene expression or other biological behavior or falsely identify genes to be of statistical 
significance. To overcome such problem, a number of methods are described above which 
allow full utilization of the microarray data. One difficulty in making use of the microarray 
data is due to the fact that the expression levels of the genes have a wide range of values or 
scattered values. It is, therefore, desirable to adjust the parameter d(i) so that it is essentially 
20 independent of the wide variation of the values of the parameter d(i) and/or of the scatter 
value s(i). After the parameter has been so adjusted, then all of the data can be fully utilized. 

In one embodiment, the adjustment is accomplished by dividing the scatter values or 
average associated values of the genes into subsets each having a similar range of values. 
For example, the scatter values or average associated values of the genes may be divided into 
25 ten subsets in accordance with which percentile such values fall into. In other words, the 
first of the ten subsets will contain the top tenth percentile of the scatter values or average 
associated values of the genes, the second subset containing the second to the top tenth 
percentile of such values and so on. The standard deviation of the parameter d(i) is then 
calculated within each subset and a coefficient of variation of the standard deviations of the 
30 parameter values for the ten subsets is then minimized by varying the value of the constant s 0 
appearing in equation 1. After the constant s 0 has been so adjusted, the parameter d(i) is then 
substantially independent of wide variations in scatter values or average associated values of 
the genes, so that all of the microarray data can be effectively used. 
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Scatter plots of d(i) vs. log[s(i)] are shown in Figure 2A-2F which are scatter plots of 
relative difference in gene expression d(i) versus gene specific scatter s(i), where the data 
were partitioned to calculate d(i) as indicated by the bar codes, and where the shaded and 
unshaded entries were used for the first and second terms in the numerator of d(i) in Equation 
5 1 set forth below. Figure 2A illustrates the relative difference between irradiated and 
unirradiated states, where the statistic d(i) was computed from expression measurements 
partitioned between irradiated and unirradiated cells. By contrast, the scatter plot for relative 
difference between cell lines 1 and 2 shows more marked changes in Figure 2B, which 
illustrates the relative difference between cell lines 1 and 2. In Figure 2B, the statistic d(i) 

10 was computed from expression measurements partitioned between cell lines 1 and 2. Thus, 
the relative difference between cell lines 1 and 2 appears to exceed that between irradiated 
and unirradiated states. 

These relative differences exceeded random fluctuations in the data, as measured by 
the relative difference between hybridizations A and B in Figure 2C which illustrates the 

15 relative difference between hybridizations A and B. In Figure 2C, the statistic d(i) was 
computed from the permutation in which the expression measurements were partitioned 
between the equivalent hybridizations A and B. 

Although the relative difference computed from hybridizations A and B provided a 
control for random fluctuations, additional controls were desirable to assign statistical 

20 significance to the biological effect of IR. Instead of performing more experiments, which 
are expensive and labor-intensive, a large number of controls are generated by computing 
relative differences from permutations of the hybridizations for the 4 irradiated and 4 
unirradiated states. To minimize potentially confounding effects from differences between 
the two cell lines, the data was analyzed using the 36 permutations that were balanced for 

25 cell lines 1 and 2. Permutations were defined as balanced when each group of four 
experiments contained two experiments from cell line 1 and two experiments from cell line 
2. Figures 2D, 2E, 2F illustrate the relative differences for three permutations of the data that 
were balanced between cell lines 1 and 2. 

Relative differences from random permutations of the hybridizations indicate noise 

30 inherent in the process of data acquisition. From the examples illustrated above, it is seen 
that relative differences stemming from the differences between cell lines may mask 
statistically significant changes in gene expression caused by radiation, so that for this 
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reason, it may be preferable to use only data from balanced permutations, to reduce the 
effects on the statistics from differences between the cell lines. 

Another control that can be exerted is by ranking the values of the relative difference 
parameter d(i) Although gene expression levels can vary widely, the relative difference d(i) 

5 is a measure of statistical significance substantially independent of expression level. As 
another control for assigning statistical significance, the largest relative differences from the 
36 permutations may indicate noise from statistical fluctuations in the data. One may 
compute the average value of the largest relative differences from all 36 permutations. Thus, 
comparing the largest relative difference among all the genes to the largest relative 

10 differences from the permutations provides one possible test for identifying genes to be of 
statistical significance. Therefore, the average of the largest relative differences from the 36 
permutations is the expected relative difference for such gene. A comparison of the relative 
difference of such gene with its expected value can be used as control as to whether 
statistical significance should be assigned to such gene. The same reasoning applies to the 

15 gene of the second highest relative difference and comparison to the second largest relative 
differences from the permutations, and so on for all the genes involved in the calculation. 

In other words, to find significant changes in gene expression, genes were ranked by 
magnitude of their d(i) values, so that d(l) is the largest relative difference, d(2) is the second 
largest relative difference, and d(i) is the i th largest relative difference, or the ith rank. For 

20 each of the 36 balanced permutations, relative differences d p (i) are also calculated, and the 
genes are again ranked such that d p (i) was the i th largest relative difference for permutation p. 
The expected d E (i) was defined as the average over the 36 balanced permutations, 

d E (i) = £pd p (i)/36 (3) 

25 

Figures 3A-3C illustrates a process for identification of genes with significant changes in 
expression. Figure 3A is a scatter plot of the observed relative difference d(i) versus the 
expected relative difference d E (i), in which the solid line indicates the line for d(i) = d E (i) ? 
where the observed relative difference is identical to the expected relative difference, and in 
30 which the dotted lines are drawn at a distance A = 1.2 from the solid line. Figure 3B is a 
scatter plot of d(i) versus scatter s(i). Figure 3C is a cube root scatter plot of average gene 
expression in induced and uninduced cells, where the cutoffs for 2-fold induction and 
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repression are indicated by the dashed lines, and where in all panels, the 46 potentially 
significant genes for A = 1.2 are indicated by the squares. 

To identify potentially significant changes in expression, a scatter plot of the 
observed relative difference d(i) vs. the expected relative difference d E (i) (Figure 3 A) is used. 
5 For the vast majority of genes, d(i) ~ d E (i). However, some genes are represented by points 
displaced from the d(i) = de(i) line by a distance greater than a threshold A. For example, the 
threshold A = 1.2 illustrated by the broken lines in Figure 3 A yielded 46 genes that were 
"called significant." These 46 genes are shown in the context of the scatter plot for d(i) vs. 
log[s(i)] (Figure 3B) and in the scatter plot for the cube root of gene expression Xi(i) vs. Xu(i) 
10 (Figure 3C). Clearly, genes identified by d(i) do not necessarily have the largest changes in 
gene expression. 

As noted above, the relative differences of the various permutations indicate noise 
inherent in the data acquisition process. Such relative differences may then be used to 
determine the number of genes falsely identified to be of statistical significance. False 

15 discovery rate may be found by comparing such relative differences to thresholds. Fig. 3 A 
may be used for such purposes as well, where the "observed" relative difference d(i) in the 
figure is one obtained from permutations as described below. 

In one embodiment, to determine the number of falsely significant genes generated by 
SAM, horizontal cutoffs were defined as the smallest d(i) among the genes called 

20 significantly induced and the least negative d(i) among the genes called significantly 
repressed. The number of falsely significant genes corresponding to each permutation was 
computed by counting the number of genes that exceeded the horizontal cutoffs for induced 
and repressed genes. The estimated number of falsely significant genes was the average of 
the number of genes called significant from all 36 permutations. Table 1, attached hereto as 

25 appendix A and made part of this application, shows the results for different values of A. For 
A = 1.2, the permuted data sets generated an average of 8.4 falsely significant genes, 
compared to 46 genes called significant, yielding an estimated FDR of 18%. As A decreased, 
the number of genes called significant by SAM increased, but at the cost of an increasing 
FDR. (Omitting s 0 from Equation 1 produced higher FDRs of 45%, 35%, and 28% for A = 

30 0.6, 0.9, and 1.2.). 

Thus, as illustrated in Fig. 3 A, the "observed" relative difference d(i) is plotted 
against expected relative difference d E (i) for all of the 36 permutations. To arrive at the plot 
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in Fig. 3 A, both the "observed" and the expected relative differences are computed using the 
associated values of the genes in the 36 permutations using equations (l)-(3) above. 

One then proceeds from the point 12 (at coordinates (0,0)) in the plot and proceed 
along line 14 at 45° to the axis in the positive direction along arrow 16. When the smallest 
5 positive "observed" relative difference d(i) is encountered that exceeds the expected relative 
difference d E (i) by a set threshold defined by dotted line 17, such as at point 20, such value of 
the d(i) is then set as a horizontal threshold. This value of d(i) then becomes a horizontal 
cutoff 22, so that the number of genes with positive "observed" relative difference values 
exceeding such threshold 22 from the 36 permutations compared to the unpeimitted data 
10 would provide an indication of the false discovery rate for induced genes. This accounts for 
the falsely significant genes that are induced. 

To discover the falsely significant genes that are repressed, one would then proceed 
again from point 12 along line 14 but along the negative direction 18 until one again 
encounters at point 30 the least negative observed relative difference d(i) that exceeds the 
15 expected relative difference d E (i) by a set threshold indicated by dotted line 19. Such 
smallest negative d(i) is then set as the negative horizontal cutoff threshold 32. The genes 
whose negative relative differences are more negative than such horizontal cutoff 32 from the 
permitted and unpermitted data are used to estimate the FDR. 

To test the above described method for determining the FDR, artificial data sets are 
20 constructed in which a subset of genes was induced over a background of noise. When SAM 
was used to analyze such data sets, the estimated FDR accurately predicted the correct 
number of falsely significant genes. 

The above method for setting thresholds provides asymmetric cutoffs for induced and 
repressed genes. In other words, the magnitudes of the two horizontal cutoffs 22, 32 need 
25 not be the same. The alternative is the standard t-test, which imposes a symmetric horizontal 
cutoff, with a d(i) > c for induced genes and a d(i) < -c for repressed genes. However, the 
asymmetric cutoff is preferred because it allows for the possibility that d(i) for induced and 
repressed genes may behave differently in some biological experiments. 

Figures 4A-4C illustrate a process for comparison of SAM to conventional methods 
30 for analyzing microarrays. Figures 4A illustrate falsely significant genes plotted against 
number of genes called significant, where of the 57 genes most highly ranked by the fold 
change method, 5 were included among the 46 genes most highly ranked by SAM. Of the 38 
genes most highly ranked by the pairwise fold change method, 1 1 were included among the 
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* 46 genes most highly ranked by SAM. These results were consistent with the FDRs of SAM 
compared to the fold change and pairwise fold change methods. 

Figures 4B is a Northern blot validation for genes identified by the fold change 
method, where values of r(i) are plotted for genes chosen at random from the 57 genes most 
5 highly ranked by the fold change method. The genes are: cyclin F (1); parathymosin (2); N- 
acetyl glucosaminyltransferase (3); eIF-4 gamma (4); dynamin (5); interferon consensus 
sequence binding protein (6); heart muscle specific protein DRAL/SLIM3/FHL-2 (7), Ul 
snRNP-specific C protein (8); and maxi K potassium channel beta subunit (9). 

Figure 4C is a Northern blot validation for genes identified by SAM, where results 
10 are plotted for genes chosen at random from the 46 genes most highly ranked by SAM: maxi 
K potassium channel beta subunit (9); cyclin B (10); PLK (11); ckshs2 (12); IL2 receptor 
beta chain (13); PTP(CAAXl) (14); p48 (15); XPC (16); Fas (17); and mdm2 (18). 

SAM proved to be superior to conventional methods for analyzing microarrays (Table 
1 and Figure 4A). First, SAM was compared to the approach of identifying genes as 
15 significantly changed if an R-fold change was observed. In this "fold change" method, r(i) = 
XiOyxu(i), g ene & w as called significantly changed if r(i) > R or r(i) < 1/R. To permit 
computation of r(i) from negative values for gene expression, Xi(i) and xu(i) were converted 
to 10 when their values were negative or less than 10. The results of this procedure yielded 
unacceptably high FDRs of 73% to 84%. 
20 Another approach attempts to account for uncertainty in the data by identifying genes 

as significantly changed if an R-fold change is observed consistently between paired samples 
(7). To apply this "pairwise fold change" method to our 4 data sets before and 4 data sets 
after IR, changes in gene expression were declared significant if 12 of 16 pairings satisfied 
the criteria r(i) > R or r(i) < 1/R. Despite the demand for consistent changes between paired 
25 samples, this method yielded FDRs of 60% to 7 1 %. 

To understand why fold-change methods fail, note that the vast majority of genes are 
expressed at low levels where the signal to noise ratio is very low (Figure 3C). Thus, 2-fold 
changes in gene expression occur at random for a large number of genes. Conversely, for 
higher levels of expression, smaller changes in gene expression may be real, but these 
30 changes are rejected by fold-change methods. The pairwise fold change method provides 
modest improvement and remained inferior to SAM. 

Of the 46 genes most highly ranked by SAM (A = 1.2), 36 increased or decreased at 
least 1.5-fold with r(i) > 1.5 or r(i) < 0.67. The number of falsely significant genes that met 
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fttese two criteria was 4.5, corresponding to a FDR of 12%. Fas was identified 3 times as 
alternately spliced forms, leaving 34 independent genes. As an indication of biological 
validity, 10 of the 34 genes have been reported in the literature as part of the transcriptional 
response to IR. TNF-a was reported to be induced by others under different conditions (8) 
5 but was repressed here. We validated our microarray result by Taq-Man PCR. 

To test the validity of SAM directly, Northern blots were performed for genes that 
were randomly selected from the 46 and 57 genes most highly ranked by SAM (A = 1.2) and 
the fold change method (at least 3.6-fold change), respectively. Northern blots showed little 
correlation with the genes identified by the fold change method (Figure 4B), but strong 

10 correlation with the genes identified by SAM (Figure 4C). Indeed, Northern blots 
contradicted only 1 of 10 genes identified by SAM, consistent with our estimated FDR. 

Nineteen of the 34 genes most highly ranked by SAM appear to be involved in the 
cell cycle. Three are known to be induced in a p53-dependent manner: p21, cyclin Gl, and 
mdm2 (9-1 1). Six cell cycle genes were repressed: ubiquitin carrier protein E2-EPF, p55cdc, 

15 cyclin B, ckshs2, cdc25 phosphatase, and weel kinase (12, 13), Five genes encoding the 
mitotic machinery were also repressed: PLK kinase, mitotic kinesin-like protein 1 (MKLP- 
1), mitotic centromere-associated kinesin (MCAK), cdc25 associated protein kinase (C- 
TAK1), and the kinetocore motor CENP-E (14-16). Four genes involved in cell proliferation 
were induced or repressed: the farnesylated protein tyrosine phosphatase PTP(CAAXl), 

20 OX40 ligand, lymphocyte phosphatase associated phosphoprotein (LPAP), and c-myc (17- 
21). Some responses were paradoxical. For example, cdc25 phosphatase and weel kinase 
have antagonistic effects on the phosphorylation state of cdc2, but both genes were repressed. 
Repression of these genes together with the mitotic genes may represent a damage response 
that dismantles the cell cycle machinery until the cell has repaired the damaged DNA. 

25 Four of the 34 genes play roles in DNA repair, but none are involved in the repair of 

IR-induced double-strand breaks. Instead, the genes (p48, XPC, gadd45, PCNA) have roles 
in nucleotide excision repair, a pathway conventionally associated with UV-induced damage 
(22-25). We confirmed the induction of these genes by Northern blot (26-28). Fornace et aL 
reported defective removal of base damage induced by IR in xeroderma pigmentosum cells 

30 (29). Leadon et al. reported that a novel DNA repair pathway involving long excision repair 
patches of at least 150 nucleotides is activated by IR, but not UV (30). Our results suggest 
that this novel pathway might include p48, XPC, gadd45, and PCNA. 
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Three of the 34 genes play roles in apoptosis (Fas, bcl-2 binding component 3, TNF- 
a). The remaining genes may have previously unsuspected roles in the DNA damage 
response, or may be among the estimated set of four falsely detected genes. Attached hereto 
as Appendix B and made a part of this application is Table 2, which sets forth the genes with 
5 changes in expression called significant by SAM. 

Discussion 

The 34 genes most highly ranked by SAM are only a subset of all the genes that 
change 1.5-fold with IR. The difference between the number of genes called significant and 

10 the number of falsely significant genes was calculated for decreasing values of A = 0.3, 0.2 
and 0.1, and found the difference to be 92, 170, and 184 respectively. Thus SAM suggests 
that at least 180 genes are induced or repressed by 5 Gy IR. 

In conclusion, SAM successfully identified those genes on a microarray with bona 
fide changes in expression. Here, SAM found genes whose expression changed between two 

15 states. SAM can also be generalized to other types of experiments by expressing d(i) in other 
ways. Suppose the data includes gene expression Xj(i) and a response parameter y j? in which i 
= 1,2, m genes, j = 1, 2, n samples. The generalized statistical parameter still takes the 
form d(i) = r(i) / [s(i) + s 0 ]. Only the definitions of r(i) and s(i) change. For example, r(i) can 
be correlated with factors other than irradiation, such as different type of tumors or survival 

20 time, as described in more detail below, where r(i) simply indicates relative differences in 
associated values, not necessarily those caused by changes due to radiation. 

To identify genes whose expression is specifically different in a subset of a set of 
samples, the parameter d(i) is defined in terms of the Fisher's linear discriminant. One goal 
might be to identify genes whose expression in one type of tumor is different from its 

25 expression in other types of tumors. Suppose that a set of n samples consists of K non- 
overlapping subsets, with yj e {1,..., K}. Define C(k) - {j : yj = k}. Let n k = number of 
observations in C(k). The average gene expression in each subset is x k (i) = ZjeC(k) x,(i)/n k and 
the average gene expression for all n samples is x(i) = £j xj(i)/n. Then define: 

30 K0 = {f£ *** /n^feCO-xCO] 2 } 1 ' 2 (4) 
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*(o = «Z *a/» t )/Z *K -wZ *Z y.c W ^(o -x k m 2 Y n (5) 

The quantity r(i) in equation 4 is the variance between subsets, and the quantity s(i) in 
equation 5 the sum of variances within each subset. Each subset may be data collected from 
a type of tumor. Thus a large value for the generalized statistical parameter d(i) indicates a 
difference in gene expression between subsets, or between the different types of tumors. The 
value of s 0 in d(i) is adjusted in a manner similar to that above by permuting the parameter k 
among the tumor subsets. 

Thus, in general, where the associated values in each set can be classified into two or 
more subsets with values in each subset having a correlation with one another, a parameter 
may be selected using a quantity related to variances between the associated values in the 
subsets of the sets and the variances of the associated values within such subset of the sets. 
The quantity may relate to the sum of the variances between the associated values and the 
subsets of the sets and the sum or variances of the associated values within each subset of the 
sets. 

To identify genes whose expression correlates with survival time, d(i) is defined in 
terms of the Cox's proportional hazards function. Express the response data in the form y, = 
(tj, 6j). Here, t, = survival time for patient (j) or censored survival time if the patient is still 
alive or lost to follow-up, and 0j = 0 or 1, depending on whether patient (j) was censored ( 0j = 
0) or died with a known survival time t, (8, = 1). Assume that there are K unique death times 
zi, z 2 , ... , z K . Let D(k), for k = 1, K be death sets D(k) = {i : t, = z k }. Let R(k) be risk sets 
R(k) = {i : t, > z k } . Let m k = number of patients in R(k). Let d k = number of deaths at time z k . 
The average expression of gene (i) in death set D(k) is: x|(i) = Z jeD(k) Xj (i)/d k . The average 
expression of gene (i) in risk set R(k) is: xt(i) = Z je R (k) Xj(i)/m k . Then define: 

r(i) = Z k d k [xt(i)-Xk(i)] (6) 



s(i) = {Zk (d k /m k) Z jeR(k) [ Xj (i) - x k (i)] 2 } 1/2 (7) 

SAM can be adapted for still other types of experimental data. For example, to 
identify genes whose expression correlates with a quantitative parameter, such as tumor 



-19- 



M-10523 
704144 vl 

stage, d(i) can be defined in terms of the Pearson correlation coefficient, as described in more 
detail in the example below. 

A method for identifying genes whose expression correlates with a continuous 
parameter would be one identifying genes whose expression in a tumor correlates with 
5 survival time of the patient with the tumor. 

Let x k (i) be the expression of gene (i) in sample (k) (e.g., the kth tumor). Define 
x(i) to be the average expression of gene (i) over all the samples. 

Let y k be the value for the continuous parameter (e.g., time) associated with 
sample (k). Define y to be the average of the continuous parameter over all the samples. 
10 The Pearson correlation coefficient, r(i) for gene (i) is: 

K0 = Z k W0-m[y k -y]hZdx k (i)-x(m 2 Z*(y k -y) 2 ] U2 (8) 

The values for r(i) are less than +1 and greater than -1. For r(i) ~ +1, the 
15 corrleation is strongly positive. For r(i) ~ -1, the correlation is strongly negative. An 
example of a modified Pearson correlation coefficient that could serve as the parameter 
d(i) is: 

d(i) = S *(m[y k -y\l (E Jx 4 (o-x(0] 2 Z *0>* ~y) 2 f 2 +s 0 } (9) 

20 

The value of s 0 would be adjusted in the manner described above, thus permitting 
comparison across the entire set of genes. To compute the expected d(i), the survival 
times are permuted among the tumors. 

In addition to applications using the Pearson correlation coefficient, another example 
25 includes the definition of d(i) for paired data, such as gene expression in tumors before and 
after chemotherapy. In each case, the FDR is estimated by random permutation of the data 
for gene expression among the different experimental arms, i.e., permutations among the n 
arms of yj. 



30 Weighting Function to Improve Data Reproducibility 

For microarrays that contain several probes for each gene, expression is typically 
computed as a simple mean or a trimmed mean, which eliminates contributions from 
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probes that diverge from the mean by a predetermined cutoff Such methods fail to 
eliminate uncertainty in the data arising from probes that do not behave appropriately as 
shown in Fig. 5A. 

Data reproducibility can be improved by modifying the contribution of each probe 
5 by a continuous weighting function. For example, the weight for probe (i) of a given 
gene can be determined by a Guassian weight function, 

w(0 = exp[-(x 1 -x 0 ) 2 /a 2 ] (10) 

10 where x 0 = mean or median of data from all the probes for the gene, a = constant 
multiplied by standard deviation or median absolute deviation of data. When the 
Gaussian weight function was applied to an experiment in which the same sample was 
hybridized twice to two microarrays, there was major improvement in the data (Fig. 5B). 
The scatter function decreased by more than a factor of two and the number of negatively 

1 5 expressed genes decreased from 25% to 0.001%. 

Thus, SAM is a robust and straightforward method that can be adapted to a broad 
range of experimental situations. SAM and its modifications are available for use at 
http://www-stat-class.stanford.edu/SAM/SAMServlet . This web site is used at Stanford 
University. 

20 

Software Implementation 

The invention has been described above, employing methods and producing plots 
as illustrated in the Figures. Such methods and graphs or plots may be produced with the 
aid of machines such as computers. Therefore, another aspect of the invention involves 

25 the software components that are loaded to a computer to perform the above-described 
functions. These functions provide results with the different advantages outlined above. 
The software or program components may be installed in a computer in a variety of ways. 

As will be understood in the art, the inventive software components may be 
embodied in a fixed media program component containing logic instructions and/or data 

30 that when loaded into an appropriately configured computing device to cause that device 

to perform according to the invention. As will be understood in the art, a fixed media 

program may be delivered to a user on a fixed media for loading in a users computer or a 

fixed media program can reside on a remote server that a user accesses through a 
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communication medium in order to download a program component. Thus another 
aspect of the invention involves transmitting, or causing to be transmitted, the program 
component to a user where the component, when downloaded into the user's device, can 
perform any one or more of the functions described above. 
5 Fig. 6 shows an information appliance (or digital device) 40 that may be 

understood as a logical apparatus that can read instructions from media 47 and/or network 
port 49. Apparatus 40 can thereafter use those instructions to direct server or client logic, 
as understood in the art, to embody aspects of the invention. One type of logical 
apparatus that may embody the invention is a computer system as illustrated in 40, 

10 containing CPU 44, optional input devices 49 and 41, disk drives 45 and optional monitor 
46. Fixed media 47 may be used to program such a system and may represent a disk-type 
optical or magnetic media, magnetic tape, solid state memory, etc.. One or more aspects 
of the invention may be embodied in whole or in part as software recorded on this fixed 
media. Communication port 49 may also be used to initially receive instructions that are 

15 used to program such a system to perform any one or more of the above-described 
functions and may represent any type of communication connection, such as to the 
internet or any other computer network. The instructions or program may be transmitted 
directly to a user's device or be placed on a network, such as a website of the internet to 
be accessible through a user's device. All such methods of making the program or 

20 software component available to users are known to those in the art and will not be 
described here. 

The invention also may be embodied in whole or in part within the circuitry of an 
application specific integrated circuit (ASIC) or a programmable logic device (PLD). In 
such a case, the invention may be embodied in a computer understandable descriptor 
25 language which may be used to create an ASIC or PLD that operates as herein described. 

While the invention has been described above by reference to various 
embodiments, it will be understood that changes and modifications may be made without 
departing from the scope of the invention, which is to be defined only by the appended 
claims and their equivalents. All references referred to herein are incorporated by 
30 reference in their entireties. 
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APPENDIX A 

Table 1. Comparison of methods for identifying changes in gene expression. To 
increase the stringency for calling significant changes in gene expression, parameters for 
each method (A and R) were increased, as described in the text. The number of falsely 
significant genes was estimated by permutation of the data sets. The false discovery rate 
(FDR) was defined as the percentage of falsely significant genes compared to the genes 
called significant. 



Method 


Genes falsely 


Genes called 


FDR 




significant 


significant 




SAM 








A = 0.4 


134.9 


288 


47% 


A = 0.5 


78.1 


192 


41% 


A = 0.6 


56.1 


162 


35% 


A = 0.9 


19.1 


80 


24% 


A= 1.2 


8.4 


46 


18% 


Fold chanee 








R = 2.0 


283.1 


348 


81% 


R = 2.5 


137.8 


169 


82% 


R = 3.0 


76.8 


99 


78% 


R = 3.5 


46.7 


64 


73% 


R = 4.0 


29.3 


35 


84% 


Pairwise fold chanae 








R= 1.2 


245.6 


355 


69% 


R= 1.3 


155.4 


220 


71% 


R= 1.5 


76.2 


118 


65% 


R= 1.7 


44.8 


70 


64% 


R = 2.0 


22.8 


38 


60% 



-23- 



:M[III!l,!iI!jyiL:!l!LlilltfflliiiI!!!ff! Illllhiflf WMM HOT! 



M-10523 
704144 vl 

APPENDIX B 

Table 2. Genes with changes in expression called significant by SAM. 
* To compute r(i) = xi(i)/xu(i)> negative levels of expression were reset to a value of 10. 
f Genes previously reported to respond transcriptionally to ionizing radiation. 
Gene functions: Black = cell cycle; Dark gray = apoptosis; Light gray = DNA repair 

Gene 



Rank 


Accession 


d(i) 


r(i) 


s(i) 


xu(i) 


Xl(l) 


INDUCED GENES 












1 


U09579 | 


9.2 


3.4 


158 


633 


2119 


2 


X83490 t 


7.7 


2.5 


26 


155 


381 


3 


U47621 


7.1 


2.9 


13 


61 


178 


4 


U18300f 


6.9 


1.9 


59 


448 


869 


5 


U48296 


6.7 


1.6 


30 


354 


583 


6 


X63717 f 


6.6 


2.2 


43 


254 


561 


7 


D21089 


5.5 


2.4 


96 


392 


930 


8 


U39400 


5.4 


1.7 


41 


349 


581 


9 


X77794 | 


5.3 


1.6 


99 


964 


1499 


10 


M60974f 


5.1 


2.5 


58 


203 


516 


11 


D90224 


4.9 


2.8 


32 


96 


270 


12 


U25138 


4.9 


9.0* 


16 


-4 


90 


13 


J05614 f 


4.8 


1.7 


352 


2358 


4043 


14 


X83492 f 


4.8 


1.8 


17 


117 


213 


15 


X85116 


4.5 


2.0 


15 


83 


163 


16 


U50136 


4.4 


1.6 


26 


231 


359 


17 


U82987 


4.2 


33.6* 


161 


-4 


336 


18 


M92424| 


4.0 


2.8 


16 


45 


125 


REPRESSED GENES 










1 


U01038 


-9.8 


0.50 


25 


551 


275 


2 


M91670 


-9.3 


0.61 


25 


693 


425 


3 


U68233 


-9.3 


0.48 


12 


275 


133 


4 


U05340 


-6.9 


0.39 


54 


642 


253 


5 


M25753t 


-5.8 


0.23 


43 


345 


78 


6 


X97267 


-5.8 


0.49 


69 


818 


400 


7 


S78187 | 


-5.5 


0.63 


20 


357 


224 


8 


X54942 


-5.5 


0.52 


87 


1026 


534 


9 


U63743 


-5.0 


0.59 


19 


265 


158 


10 


D86973 


-4.9 


0.64 


16 


264 


168 


11 


X62048 f 


-4.9 


0.44 


8 


99 


43 


12 


M80359 


-4.9 


0.40* 


6 


25 


-19 


13 


U28386 


-4.8 


0.58 


77 


928 


541 


14 


X02910 


-4.8 


0.37 


20 


170 


63 


15 


D31764 


-4.7 


0.26 


36 


247 


64 


16 


X67155 


-4.4 


0.30 


28 


199 


60 


17 


HG3523 


-4.4 


0.51 


40 


391 


200 


18 


Z15005 


-4.2 


0.28* 


10 


36 
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No55 nucleolar autoantigen 




NOF1 



cyclm Gl 



OX40 ligand, TNF hgand superfamir 



maxi K potassium channel beta subunit 




EPB72, integral membrane protein 
leukotriene C4 synthase (LTC4S) 




PLK, polo kinase homoloe 



ubiquitin carrier protein (E2-EPF) 



HRR-1 farnesol receptor 



55cdc: present in dividing cells 



cyclm B 



lymphocyte phosphatase assoc phosphopr 



cdc25 phosphatase 



ckshs2, cksl protein homolog 



MCAK, mitotic centromere-associated kin 



GCN1, translational regulator of GCN4 



C-TAKL cdc25c associated protein kinase 



hSRPlcc, NLS receptor 

msmmssmmsmssm 

hEphBlb, Eph-like receptor tyrosine kinase 



MKLP-1, mitotic kinesm-hke protein- 1 



c-Myc, alternate splice form 3 



CENP-E putative kinetochore motor 
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